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Abstract 

Markov chain Monte Carlo (MCMC) algorithms provide a very general recipe for 
estimating properties of complicated distributions. While their use has become com- 
monplace and there is a large literature on MCMC theory and practice, MCMC users 
still have to contend with several challenges with each implementation of the algorithm. 
These challenges include determining how to construct an efficient algorithm, finding 
reasonable starting values, deciding whether the sample-based estimates are accurate, 
and determining an appropriate length (stopping rule) for the Markov chain. We de- 
scribe an approach for resolving these issues in a theoretically sound fashion in the 
context of spatial generalized linear models, an important class of models that result in 
challenging posterior distributions. Our approach combines analytical approximations 
for constructing provably fast mixing MCMC algorithms, and takes advantage of re- 
cent developments in MCMC theory. We apply our methods to real data examples, and 
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find that our MCMC algorithm is automated and efficient. Furthermore, since starting 
values, rigorous error estimates and theoretically justified stopping rules for the sam- 
pling algorithm are all easily obtained for our examples, our MCMC-based estimation 
is practically as easy to perform as Monte Carlo estimation based on independent and 
identically distributed draws. 



1 Introduction 



Markov chain Monte Carlo (MCMC) methods have become standard tools in Bayesian in- 
ference and other areas in statistics. When inference is based on some distribution ir, the 
Metropolis-Hastings algorithm provides a general recipe for constructing a Markov chain with 
it as its stationary distribution. From a careful practitioner's standpoint, however, MCMC- 
based inference poses several challenges. In addition to developing and fine-tuning an MCMC 
algorithm that produces accurate sample-based inference quickly, MCMC users also need to 
determine an appropriate length for the Markov chain. These issues pose a challenge to non- 
experts since, even for a specific class of models, the MCMC algorithm needs to be tuned 
carefully to the posterior distribution resulting from each new data set. Also, the commonly 
used MCMC "conv ergence diagnostics " used to determine stopping rules for the algorithm 
may be unreliable ( jCowles and Carlinl . Il996l ) and may not always be directly connected to 
the central goal of MCMC-based inference, which is to estimate properties of the posterior 
distribution ir up to some desired level of accuracy. Furthermore, users may not always have 
a reasonable approach for finding starting values for the algorithm. 

In this paper, we consider approaches for resolving these closely related issues in a rigor- 
ous manner. We describe how we can construct provably efficient MCMC algorithms where 
the MCMC standard errors, which represent the accuracy of sample-based estimates, can be 
estimated consistently. These MCMC standard errors can, in turn, be used in a theoretically 
sound approach to determine an a ppropriate length for the Markov chai n using new develop- 



ments in MCMC output analysis (IFlegal et al. 



2008 



Jones et al. 



20061 ) . Our approach also 



automatically provides reasonable starting values for the MCMC algorithm. Hence, the re- 
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suiting MCMC-based inference is automated, theoretically sound, and practical because: (i) 
initial values for the chain are obtained automatically, (ii) the algorithm produces accurate 
answers quickly and is therefore useful in practice, (iii) an appropriate Monte Carlo sample 
size (length for the Markov chain) is determined automatically, and (iv) the accuracy of the 
sample-based estimates can be assessed rigorously 

A key tool we develop for the samplers described here is an accurate heavy-tailed ap- 
proximation to the posterior distribution of interest. In this paper we describe how one can 
construct such an approximation and investigate it s use in const ructing two samplers: a 
fast mixing independence Metropolis-Hastings chain (iTierneyl . Il994t ) and the classic rejection 
('accept-reject') sampler. We study these algorithms in the context of a popular class of 
spatial generalized linear models. Naive MCMC samplers are known to mix poorly for these 
models, and little is known about the theoretical properties of the sa mplers. These mod - 



els are closely related in form to the geostatistical models described in 



Diggle et al. 



jl998|), 



19991) for 



which are Bayesian versions of generalized linear models (IMcCullagh and Nelder . 
spatial data. Hence, our approach, or variants of it, may be applicable to several other 
important statistical models. Our MCMC algorithm is virtually as simple to use as simple 
Monte Carlo using independent and identically distributed draws. An underlying assumption 
of our work here is that sample-based inference is of interest, for instance when propogating 
uncertainties associated with inferences based on one model or model-component to other 
models, or when modelers are interested in reporting estimates at a level of accuracy that 



they would like to control. W hen sample-based inference is not critic a 



analytical approximations (cf. 



Rue et al 



2009 



Tierney and Kadane . 



, we note that purely 



19861 ) may, of course, 



be completely 'automatic' approaches for approximate inference since they avoid the MCMC 
issues outlined above. 

In Section [2] we provide a general description of the samplers we consider, along with a 
discussion of relevant MCMC theory. In Section [3] we discuss the class of models to which 
we apply our methods. In Section H] we describe a general approach for deriving heavy-tailed 
approximations in hierarchical models, discuss how such approximations can be used to 
construct fast mixing MCMC algorithms, and provide theoretical details. Section describes 
the application of these methods to several real data sets. We conclude with a discussion of 
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our results in Section O 



2 Background 

This section provides a brief overview of the samplers we consider, along with some relevant 
theoretical background. In Section |2~TT1 we provide some basic theory on Markov chain Monte 
Carlo, which sets the stage for the fast mixing MCMC sampler. In Section 12.21 we briefly 
review rejection sampling since we are later able to use our heavy-tailed approximation to 
construct effective rejection samplers in some cases. 



2.1 Markov chain Monte Carlo basics 



Our goal is typically to estimate expectations with respect to a distribution ir. That is, we 
are interested in E n g(x) = J Q g(x)7r(dx), where g is a real-valued ir— integrable function on Q. 
Consider a Harris-ergodic Marko v chain X = |X| , Xo , . . . } with state space Q and stationary 



distribution 7r (for definitions see 
to the ergodic theorem 



Meyn and Tweedid . 



19931 ). If E n \g(x)\ < oo, we can appeal 



1 - 

g n = — / g(Xi) — > E n g(x) with probability 1. 
n ^— ' 

i=l 

Now let P n (x, ■) be the n-step transition kernel for this chain, so that P n (x, A) = P(X i+n G 
A | Xi = x) for x G fl and any measurable set A. Then it also follows that 

lim ||P rt (a;, •) - ir(-)\\ TV = for any x G fi, 

n— >oo 

where the converg ence is in total variation distance, which implies convergence in distribution 



(IBillingsley 



19991 ). However, to estimate standard errors, we need to appeal to a Central 



Limit Theorem (CLT), which does not hold in general. For a CLT to hold, the rate of 
convergence of the Markov chain to 7r is critical. A Markov chain is said to be geometrically 
ergodic if for some constant t G (0, 1) and 7r-almost surely finite function M : — > M. + and 

n G N 

||P n (x,-) -tt(-)||tv < M(x) t n for any xeVt. 
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If M(%) is bounded, then X is uniformly ergodic. If X is uniformly ergodic and E n g 2 < oo 
we have a CLT 

^(j n -M4N(o, ff ;) 

as n — > oo with a 2 = va,i n {g(Xi)} + 2 COY ^ I Q ( X^ ) , q ( Xj) } . ^ For a review of Markov chain 



CLTs under other sets of sufficient conditions see 



Jones (2004) and 



(120041 ) . For a general review of MC MC theory r e 



If the CLT holds as above then 



Jones et al. 



er to 



Tiernevl (I1994h . 



-toberts and Rosenthal 



(120061 ) provide a consistent estimate of a 



2 

<r 



the MCMC standard error in g n , which can then be used to determine how long to run the 
Markov chain. In the 'fixed width' approach to determining chain length, the user stops 
the simulation when the wi dth of a conf i dence interval based on a n ergodic average is less 



than a user-specified value (IFlegal et al. 



2008 



Jones et al 



20061 ). Thus, for a uniformly 



ergodic chain, sample-based inference is comparable to i.i.d. Monte Carlo in many ways: a 
CLT holds under similar conditions, a consistent estimate of Monte Carlo standard errors 
is easily obtained, and the error estimate can be used to determine a stopping rule for the 
sampler. Note that it is challenging to construct uniformly ergodic Markov chains for real 
problems, and it is generally not easy to prove that a given Markov chain is uniformly or 



geometrically ergodic; most such 



conditions (cf. 



proo 



Hobert and Geyerl . 



1998|). 



s have relied on establishing drift and minorization 



We briefly describe the consistent batch means approach to calculating Monte Carlo 
standard errors. Suppose the Markov chain X is run for a total of n = ah iterations (hence 
a and b are implicit functions of n) and define 



Y., 



jb 

i=(j_i) 6+ i 



for j 



The batch means estimate of a 2 g is 



a: 



a-l^ y 3 



9nf 



3=1 



Jones et al. 



( 120061 ) showed that if the batch size and the number of batches are allowed to 



increase as the overall length of the simulation increases by setting b n = \_y/n\ and a n = 
\n/b n \ then a 1 — > a 2 with probability 1 as n — )■ oo. This is called consistent batch means 
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(CBM) to distinguish it from the standard (fixed number of batches) version. This is an 
attractive approach to estimating standard errors since it is easy to compute and holds 
under the regularity conditions that the chain is unif ormly ergo di e and E n \ q \ 2 < oo (though 



these are not the only set of sufficient conditions; see 



Jones et al. 



d2006|) for details). 



2.2 Rejection sampling 



Rejection or 'accept-reject sampling' ( Ivon Neumann! . Il95ll ) is a well established, simple but 



powerful method for generating ran dom variates from a given distribution ir with support Q 



(also see 



Robert and Casella 



20051 ). Assume we have a proposal distribution r so that we 



can draw random samples from r, and we know B such that 

7i(x) 



ess sup 



< B, for some B < oo, 



xen r{x) 

where "ess sup", the essential supremum (the supremum over all but a set of measure zero), 
is taken with respect to n. The accept-reject sampling algorithm is as follows: 

• Draw X ~ r and draw U ~ Uniform(0,l). 



If U < 



AX) 

r(X)B 



return X, else do not return X. 



Values returned by the above algorithm are distributed according to ir. Note that we only 
need to know both ir and r up to a constant of proportionality, that is, we could replace ir(x) 
and r(x) with unnormalized functions h(x) and q(x) where h(x) oc tt(x) and q(x) oc r(x). 
However, in addition to satisfying ([1]), we also need a specific value of B that satisfies this 
condition. These are stringent requirements, and explains why rejection sampling is rarely 
considered a practical option for sample-based inference for realistic Bayesian models. 



3 Generalized linear models for spatial data 

Spatial generalized linear models are very convenient models for spatial data when the 



sampling mechanism is 



mown to be non-G aussian. The spatial dependence can be modeled 



via Gaussian process es (IDiggle et al 



Rue and Held 



19981 ) or Gaussian Markov random fields (GMRFs) (cf. 



20051 ). For brevity, we only describe GMRF-based models for count data as 
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this is the example used later in this paper. Other models such as Gaussian process-based 
models may be specified in analogous fashion. 

Consider the following hierarchical spatial model for areal data (data arising as sums or 
averages over geographic regions): the count in region i, denoted by Y t for i = 1, . . . , N, is 
modeled as a Poisson random variable with mean £ , i exp(/ij). Ei, assumed fixed and known, 
is the count in region i when assuming constant rates for all regions and is typically the 
product of the population of the ith region and the overall rate in the entire study region. 
is the log-relative risk specific to the ith region. Hence the Y^s are modeled as conditionally 
independent random variables, 

Yi |^~Poi(£ ie «), z = l,...,JV, (2) 

with /ij modeled linearly as /ij = Oi + fa, and = (9\, .., 6^) T and <fi = (0i, .., 4>n) t are vectors 
of random effects. The 8^s are independent and identically distributed normal variables, while 
the 4>iS are assumed to follow a GMRF. In this way, each 0j captures the ith region's extra- 
Poisson variability due to area-wide heterogeneity, while each 0« captures the ith region's 
excess variability attributable to regional clustering. These distributions are specified as 
follows: 

6i\r h ~ JV(0, 1/T h ), and fcty^ ~ N{n^,(^),i = 1, N , 
where ^ = ^'"'^ and a* ~ 1 



The Ufa for a region i is thus a weighted average of the clustering parameters in other 
regions. Here we use the most common weighting, where Wij is taken as 1 if regions % and j 
are immediate neighbours, and otherwise. This improper prior can be written as 

where 

rii if % = j 
Qij = \ if % is not adjacent to j 
— 1 if % is adjacent to j 
The model specification is completed by specifying priors on the precision (inverse variance) 
parameters, ~Gamma(a/ i , ft) and r c ~Gamma(a c , fl c ). Inference for this model is based 
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on the 2N + 2 dimensional posterior distribution tt(0, </>, r^, r c | Y), where Y = (Yi, 
The posterior distribution is provided in Append ix lAl 



,Y n ) 



This model was used by 



Besag et al. 



( 19911) f or Ba yesian image restoration, and has 



since become popular in disease mapping ( Molliel. 



hierarchical Bayes models (see for example 



1996). GMR Fs are very widely used in 



Banerjee et al. 



2004] ) . The distributions arising 



from such models are challenging enough so standard MCMC samplers for the posterior 



distributions are known to mix poorly. A variety o 



models have been studied in 



Knorr-Held and Ruel ( 120021 ) and 



imp roved MCMC app r oache s for these 



Haran et al. 



( 120031 ) but their 



theoretical properties are not known. The issues described in Section [T] are hence unresolved: 
determining an appropriate chain length is difficult, and MCMC standard errors of estimates 
are not estimated rigorously. 



4 Constructing automated samplers 



We describe an approach for constructing samplers that are virtually automated for the class 
of spatial generalized linear models we consider in this paper. To achieve this, we first derive 
an approximation for posterior distributions arising from hierarchical models with latent 
Gaussian random fields. A heavy-tailed version of this approximation is then used to create 
a proposal distribution that allows for the construction of the samplers described in Section 

m 

We derive the approximation by first obtaining an approximation for the joint distribution 
in a form that allows for a large number of parameters to be integrated out analytically. The 
method of integrating out model parameters to sample fro m a lower dim e nsional margina l 



distribution has bee n explo r ed by several auth o rs, inc l uding 



Everson and Morris! ( 120001 ) 



Gamerman et al. 



(120031 ): 



Blinded! (12003 



Wolfinger and Kass 



Everson! (j2Q0lh : 



()2000|). In each 



of the hierarchical models considered above, it is possible to analytically obtain the exact 
marginal distribution of the variance components. Furthermore, any MCMC algorithms used 
when sampling from the marginal distributions neither have known theoretical properties, 
nor rigorous estimates of error associated with them. For hierarchical models in general and 
the spatial models considered here in particular, exact formulae are not unavailable for any 
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marginal distributions. 



4.1 A heavy-tailed approximation for hierarchical models 

We now describe our general approach for deriving an approximation for a posterior distri- 
bution of a variance component model. Denote the precision parameters by A, the random 
effects by and the data by Y. Suppose the distribution of interest is vr(0, A| Y) and we can 
derive an approximation 7r(0,A|Y) for which we can analytically obtain the marginal dis- 
tribution of the variance components, si(A|Y), and the conditional distribution, s 2 (0|A, Y). 
Approaches for deriving such a ir involve using a Gaussian approximation to the likelihood, 
as descr ibed in Section I4.2t in gen eral, the Laplace approximations may be useful for this a s 



well (cf. iRobert and Casellal . 120051 ). Our general approach, first described in iBlindedl (120031 ) 



can be summarized as follows: 

• Step 1: Approximate the target posterior, 7r(0,A|Y), as arising from a generalized 
linear model, by 7r(0,A|Y). 

• Step 2: Analytically integrate 7r(0,A|Y) with respect to to obtain the approx- 
imate marginal posterior si(A|Y). 7r(0,A|Y) can then be written as the product 
Si(A|Y)s 2 (0|A, Y) where s 2 is the easily obtained approximate conditional distribu- 
tion of the random effects. 

• Step 3: Find r-y and r 2 such that they are heavier tailed distributions with similar 
shapes to s\ and s 2 respectively. r(0,A|Y) = ri(A| Y)r 2 (0|A, Y) is then a heavy- 
tailed approximation to 7r. If we can demonstrate that r is heavy-tailed with respect to 
Ti, that is, it satisfies ([T]), it can be used to construct samplers with desirable properties. 

We believe the approach outlined above can be used to derive useful approximations for 
several interesting and important Bayesian models. We focus our attention here on showing 
how it can be applied to the models described in Section [3j 
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4.2 A heavy- tailed approximation for a spatial generalized linear 
model 

This subsection provides an outline of the derivation of the approximate marginal distribu- 
tions (si) of the precision parameters for the GMRF model described in Section [31 Details 
are provided in Appendix [A3 A Gaussian approximation for the likelihood (j2J) is 

Y^NiE^^Eie^) (3) 

Let fii be log( Yj/i?,). The delta method gives us the approximation 

fi^NiOi + fal/Yi). (4) 



Note that if Yi is 0, we re place it with 0.5 w 



ren constructing the approximation. A similar 



strategy was described in lHaran et al.l (120031 ) in the context of block MCMC sampling. If 



we denote 6 = (9\, . . . , 6n) t , <f> = (4>i, ■ ■ ■ , 4>n) t and 6 = (6 T ,(f) T ) T we can analytically 
integrate the approximate joint posterior distribution 7r(0, t^, t c \Y) with respect to 9 to 
obtain Si(r/ l , t c |Y), an approximation to the marginal distribution of (t/j,t c ). 

If we let V~ l = Diag(Fi, . . . , Ijy) and fi = . . . , £in) t , the approximate distribution of 
the random effects parameters, conditional on the precision parameters is 



(5) 



where 

V" 1 + r h I + V- 1 



C2NX2N 
and 



v~ l + t c Q 



J D lx2JV = (-2A T V- 1 ,-2A T ^ 1 ). 

We use heavy-tailed distributions r\ and r 2 that have roughly the same shape and scale as s\ 
and S2 respectively. We let r\ = ri ri6, where r ia and ri& are independent log-t distributions. 
Among candidate distributions considered were the gamma, Weibull and log-normal, but the 
log-t was found to be the most appropriate due to its tail behavior and flexibility. r 2 was 
then obtained by using a multivariate-t with the mean and covariance of s 2 (0, 0|t^, t c , Y). 
We can now state the following result. 
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Proposition 1. Let r(Th,r c , 0) = r 1 (r/ l , r c )r2(0 | Th,r c ). Let it be the posterior distribution 
corresponding to the Pois son- Markov random field model in Section 0. Then, 

ir(T hl T c ,Q) 

ess sup — ; — - < n, 

r{T h , t c , 6) 

for some B < oo, with r h > 0, r c > 0, 6 e R 2n . 

Proof. See Appendix [Bj □ 

To specify the parameter values for r\ and r2, we find r\ and r2 that best match s\ and 
S2- This can be done in a number of ways, but a simple and effective approach we used 
was as follows: The profiles of the log-transformed function Si(r^, r c |Y) along Th and r c were 
plotted. The heavy-tailed proposals for the precision parameters were found on the log- 
scale by matching the mode and variance of the t-distributions to the approximate log-scale 
profiles. The corresponding log-t distributions ri a (r/ l ) and rib(r c ) were used jointly as the 
proposal ri(r/ l , r c |Y) for (r^, r c ). It is easy to draw (r^, r*, 6*) from rir 2 : 

1. Draw rl ~ r ia and r* ~ r^. 

2. Conditional on the values drawn above, simulate 0* ~ r 2 (0|r^, r*, Y). 
4.3 Sampling algorithms 

We now provide some details of the samplers constructed based on the approximations ob- 
tained in the previous sections. The ease of producing samples from r, combined with 
Proposition 1, allows us to construct two samplers: 

1. Rejection sampler: r can be used as a proposal distribution in a classical rejection 
sampling algorithm. 

2. Independence Metropolis-Hastings: r can be used as a proposal in an independence 
chain algorithm, where every proposed upda te to the Mark ov chain is obtained from r, 



regardless of the current value of the chain (jTierneyl . Il994l ). In particular, we consider 



an independence chain where the entire state space is updated in a single block. The 
starting value for the Markov chain is obtained by drawing a sample from r. 
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An important result corresponding to the above independence chain is obtained. 



Proposition 2. Consider an independence Metropolis-Hastings algorithm with target density 
7t(t/i, t c , 0) and proposal density r(rh, t c , G) . The resulting Markov chain is uniformly ergodic. 



Proof . Follows directly from Proposition 1 and Theorem 2.1 in iMengersen and Tweedie 



An important consequence of Proposition 2 is that a Central Limit Theorem can be 



shown to hold for the independence Markov chain, and a con sistent estimate o 



standard errors is easily obtained via consistent batch means (I Jones et al.l . 120061 ). as discussed 



Monte Carlo 



in Section 12.11 Furthermore, the Markov ch ain can be stoppe d based on the estimated 
standard errors attaining a desired threshold ( jFlegal et al.l . 120081 1 . Of course, the fact that 
the sampler is uniformly ergodic does not, on its own, imply that it is an efficient sampler in 
practice. We therefore study the efficiency of our indepedence Metropolis-Hastings algorithm 
in a variety of real data situations in Section [5J 

We note that r also satis fies conditions for the construction of a perfect tempering sampler 



( Moller and Nicholls 



20091 ). an algorithm that utilizes simulated 



variant of the Propp- Wilson perfect sampler (IPropp and Wilson!, 



;empe ring to construct a 



1996 



Moller and Nicholls 



). This presents an 



( 120091 ) report an increase 



intriguing possibility for future research since 
in efficiency by using perfect tempering. Ho wever, as also seen in an in depth study of perfect 



tempering for such models in 



Blinded! ( 120031 ). it is challenging to construct a perfect tempering 



algorithm that is consistently more efficient than the much simpler rejection sampler for the 
examples considered here. 



5 Applications to Cancer and Infant Mortality Data 
5.1 Description of data sets 

We consider a total of three data sets: two on cancer in the U.S. state of Minnesota, and 
one on infant mortalities in the United States. The first two Minnesota cancer data sets 
were obtained from the Minnesota Cancer Surveillance System (MCSS), a cancer registry 
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maintained by the Minnesota Department of Health. The MCSS is population-based for the 
state of Minnesota, and collects information on geographic location and stage at detection 
for colorectal, prostate, lung, and female breast cancers. We illustrate our computational 
approaches by analyzing the MCSS data for two of the cancers, breast and colorectal. Each 
of the 87 counties in the data set has associated with it the total number of cancer cases 
recorded between 1995 and 1997, and the number of these detected late. We then take the 
expected number of late detections for that county as the number of cancer cases for that 
county multiplied by the statewide rate of late detections. The question of interest is whether 
there are clusters of counties in the state of Minnesota with much higher than expected late 
detection rates for either cancer. The spatial model provides smoothed estimates of the 
relative risk of cancer cases being detected late in each county. For these data, the posterior 
distribution based on the model in Section [3] has 176 dimensions. 

We also consider a larger data set on infant mortalities in the United States. These 
data are derived from the Bureau of Health Professions Area Resource File, which is a 
county-level database for health analysis. Total number of births and deaths are obtained 
from 1998 to 2000. The geographic units used in this study include all counties of the 
following five states: Alabama, Georgia, Mississippi , North Carolina and South Carolina 



([Health Resources and Services Administration! . 120031 ) . The substantive problem of interest 



with this data is to determine spatial trends in infant mortality, and finding clusters of regions 
with unusually elevated levels in order to study possible socio-economic contributing factors. 
This data set, resulting in a posterior distribution of 910 dimensions, affords an opportunity 
to study the performance of our algorithms in a different and potentially more challenging 
high dimensional setting. 

5.2 Setting up the algorithm 

To find appropriate heavy-tailed distributions, it is useful to look at profiles of the approxi- 
mate marginal posterior distributions (on log-scale) and find t-distributions that match them 
reasonably well. The log-t distributions are obtained by exponentiating these t-distributions. 
Once the log-t distributions are specified, the joint proposal distribution is obtained easily 
according to (jSJ), and both rejection and independence Metropolis-Hastings samplers can be 
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constructed. For the rejection sampler, we find an upper bound B that satisfies ([I]) by nu- 
merically optimizing the ratio of the target and candidate densities. This worked well for 
our example s , thou gh an alternative would be to use the "empirical-sup" rejection sampler 
(jCaffo et al.l . |2002| ) to allow for the value of B to be adaptively estimated by the maxi- 
mum based on the simulated candidates. We note that an advantage of the independence 
Metropolis-Hastings algorithm is the fact that this upper bound B is not needed. 

The rejection sampler based on the heavy-tailed approximation produces a small number 
of accepted samples relative to the number of samples proposed. Similarly, the independence 
M-H algorithm may have low acceptance rates, especially for high dimensional distributions. 
It is therefore important to use efficient computational methods both for drawing the pro- 
posals, and for evaluating the Metropolis-Hastings ratios. The steps that take up most of 
the computational time involve operations on large matrices. Thus, for these samplers to 
be u s eful i n practice it is necessary to exploit the sparseness of these matrices. We follow 



Rud (]200lh by minimizing the bandwidth of the matrices involved and running fast band- 



matrix algorithms to speed up matrix computation. We find that this dramatically speeds 
up computation time and makes rejection sampling practical in some cases, and makes in- 
dependence Metropolis-Hastings efficient and practical in all the examples we consider. The 
computer code was entirely implemented in R ( llhaka and Gentleman! . Il996l ). utilizing appro- 
priate sparse matrix routines written in FORTRAN from the well established online resource 
Netlib flhttp : // www . netlib . org ) . 



5.3 Results 

The efficiency of the algorithms were compared both in terms of the number of samples 
required for the estimates to attain a desired level of accuracy, as well as the total time taken 
by the algorithm before stopping. The same desired level of accuracy was specified for each 
of the parameters in both algorithms. For example, for the breast cancer data set standard 
errors for each of the random effects was set to be no more than 0.01, while it was set to 
be no more than 2 for each of the precision parameters. These results are summarized in 
Table 1. Clearly, the independence M-H sampler is most time efficient for all three data sets. 
The efficiency of the independence M-H algorithm when compared to rejection sampling is 
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Table 1: Comparison of rejection sampler and independence MH (I-MH) algorithms 



data set 


samples required before stopping 


total time taken (seconds) 


rejection 


I-MH 


rejection 


I-MH 


breast cancer 


4,118 


29,241 


2,663s 


183s 


colo-rectal cancer 


4,735 


27,225 


543s 


170s 


infant mortality 




97,721 




10,066s 



in accordance with results in iLiul (Il996f ). For the breast cancer and colo-rectal cancer data 
sets, exact sampling is viable. Note that while the number of samples required is similar 
for the breast cancer and colo-rectal cancer data sets, the time taken to produce samples is 
much longer for the breast cancer data. This is because fewer exact samples were returned 
per unit time for the breast cancer data set. For the infant mortality data set, the rejection 
sampler was unable to attain the desired accuracy level, even after the algorithm was run 
for well over 60 hours. The independence sampler, however, was able to provide estimates 
within 3 hours. Hence, the independence sampler may be practical even when the exact 
sampler is not. We feel it is important to note that the timings above assume that we do not 
utilize any form of parallel computing. Both the rejection sampler and the independence M-H 
sampler are 'embarrassingly' parallel in that each of the draws from the proposal as well as 
the target distribution and proposal evaluations can be done entirely in parallel. Hence, with 
relatively little effort the computational effort can be reduced by a factor corresponding to the 
number of available processors. Hence, for instance, a 3 hour independence M-H algorithm 
would take well under 5 minutes if 50 processors are available. This is of particular interest 
since computer clusters with large numbers of available processors are becoming increasingly 
accessible for scientific computing. 

A motivation behind exploring rejection sampling is the simplicity and rigor of sample- 
based inference: easily computable consistent estimates of standard errors, avoidance of issues 
about determining starting values and not having to rely on ad-hoc approaches for deter- 
mining Markov chain length. While the classic rejection sampler we have constructed based 
on our heavy-tailed approximation works surprisingly well in some cases, it is impractical 
for challenging, high dimensional examples. We find our independence Metropolis-Hastings 
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sampler is still efficient and feasible in such cases, while retaining much of the simplicity of 
exact sampling. In particular, to obtain starting values for our independence chain, we simply 
simulate from our heavy-tailed approximation r. From Proposition 2, this chain is uniformly 
ergodic, and we can easily see from results described in Section |2~T1 that Monte Carlo stan- 
dard errors for different parameters can be estimated consistently, and these standard errors 
can be used to provide a stopping rule for this sampler based on sound principles, just as 
one would for exact samplers. Hence, from the user's perspective, sample-based inference is 
no more complicated than for inference based on the exact samplers. For the data sets and 
models considered here, this sampler resolves all MCMC issues originally raised in Section [TJ 

6 Discussion 

We have demonstrated that it is feasible to implement samplers for realistic hierarchical 
Bayesian models in a manner that permits rigorous estimation of standard errors, while 
avoiding the usual issues regarding the determination of simulation lengths. We focused on 
hierarchical models where we are able to construct accurate heavy-tailed analytical approxi- 
mations. We described a class of models to which we can apply our approximation techniques 
and derive theoretical results. Our general approximation strategy (in Section 14. ip is more 
broadly applicable, as are the central ideas behind automating the starting and stopping 
of the sampler in rigorous fashion. We do not claim that our approach to constructing the 
approximation is the best for any given problem since, for example, other approaches may 
improve upon the proposals constructed for the example outlined in Section HJ Better ap- 
proximations will naturally lead to more efficient samplers. Rather, we believe the purpose 
of this paper is to show that using carefully derived heavy-tailed proposals along with recent 
developments in MCMC theory, sample-based inference can be carried out in a simple, fairly 
automated and rigorous fashion for some models. 

For the examples in this paper we have explored exact sampling via the rejection sampler 
and have successfully demonstrated the applicability of our methods to some real data sets. 
These exact samplers, however, are generally not feasible for higher dimensional problems. A 
fast mixing independence Metropolis-Hastings algorithm is a more efficient alternative, while 
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retaining the rigor and simplicity of inference with exact samplers by guaranteeing that a 
Central Limit Theorem holds, and allowing for simple but consistent estimates of standard 
errors and intuitive and theoretically justified stopping rules. The increased efficiency of the 
independence chain becomes critical in higher dimensional problems, where it is viable even 
when the exact samplers are not. A potential weakness of our independence chain algorithm 
is that it involves block updates of high dimensions, which can lead to low acceptance rates 
as the dimensions of the problem increases. While this is certainly of concern for very large 
dimensional problems, we believe that for moderately high dimensional problems (thousands 
of dimensions), a combination of sparse matrix approaches and the latest high speed parallel 
computing, holds much promise for generating and evaluating high dimensional proposals 
extremely quickly. Such approaches are being explored elsewhere. 

Using a combination of analytical work and modern computing power, our results suggest 
that it may be possible to construct rigorous, nearly automated approaches to sample-based 
inference for some classes of models that are of practical importance. In situations where 
multimodality may be an issue, the heavy-tailed approximation also provides a simple and 
reasonable approach for generating starting values for simulating multiple chains on several 
different processors, since the approximation is genuinely over-dispersed with respect to the 
target distribution. While obtaining theoretical results for each new model using our ap- 
proach may prove to be non-trivial in general, we believe that the methodology outlined 
for constructing accurate heavy-tailed approximations may be generally useful, and our fixed 
width stopping rules may still be valuable in cases where analytical results are hard to obtain. 

A Approximate marginal and conditional distributions 

The full joint posterior distribution from Section [3] is: 



7T 



(0, <f>, r h , t c ) cxexp ^((0i + ^)Yi - E.e 8 ^) exp I --0 T (r h I)e 



) 




0h Pc 



) 
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The approximate joint posterior distribution based on (j3J), fllj) is: 
7r(6, r h , r c \Y) cc exp (-l(jj,-(0 + 4>)) T V-\i* -{6 + </>)) - \o T (r h I)0 - ^<f> T (r c Q)<^ 

xrf^-Vf/^^exp - r c //? c ) 

where A = Qog(Yi/£i) , • • • , log(Y N / E N )) T , M — N — l, V' 1 = Bmg(Y u . . .,Y N ), I is an Nx 
N identity matrix, Q is the adjacency matrix described in Section [3J and ah,<y c , Ph, fi c > 
are hyperparameters of the Gamma density (as in Section |4~2]) . To obtain the approximate 
marginal posterior distribution si(r/ l ,r c |Y) up to a constant of proportionality, we integrate 
7r(0, Th, t c |Y) with respect to to obtain 

Sl (r h , r c \Y) = T ^^-\M/^-i exp { _ Th//3h _ Tc/(3c) 

x (det(r /l ^- 1 + t c V- x Q + r h r c Q)y 1/2 
1 



xexp<( -- ( T C0 + DQ + A; 



with 



a 



2Nx2N 



V" 1 + r h J +V- 1 



and 



x27V 



e 



T 



T ~ T 



(0 ,0 



where = ^Q(I + ^Q + t c VQ)~ 1 (i and ^) = {I +^Q + t c VQ)~ 1 fi. For convenience, we have 
denoted C by C(r^,r c ). Our heavy-tailed approximation for the joint marginal of (t^., t c ) is 
a product of log-t distributions, 



ri(T h ,T c \Y) oc 



TftT c 



1 



-, -K+l)/2 r 



1 



1 / log(r c ) - \i c 



Or 



> c +l)/2 



where // c £ ^ °7n °c > and z/^, z^ c G Z + are tuned to match the approximate marginal 
posterior density s 1 (r/ 1 , r c | Y). Our approximation for the conditional distribution of the 
random effects parameters, 7r(0 | Y) is a multivariate-t version of (131) : 

-O r +27V)/2 

' ,11, 2 1 f 1 + -(0 - n N ) T C{Q - mat) ) 



r 2 (0|r,,r c ,Y) 



| C |0.5 r (*V±27V) / x 
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where /i^ = —C~ 1 D T /2 and v r G Z + . Note that for technical reasons (see Lemma [2]) Q in 
the matrix C is replaced by a positive definite Q matrix. Our heavy-tailed approximation 
for the joint density 7r(0, Th, t c |Y) is 

r(6, r h , r c |Y) = n(r & , r c |Y)r 2 (6|r h , r c , Y). 



B Proofs 



We begin with several lemmas that will be helpful for proving Proposition [TJ We follow the 
notation used in Sections [3] and HI 



Lemma 1. 



N/2 M/2 
T h Tc 



< 



( min Y + Th 



1/2 



\C(T h ,T c )\V* ~ \ minY ) r i/2 n Jv-i A i/2' 
where Ai, . . . , Aa?_i are £/te non-zero eigen-values of Q. 



Proof. 



\C\ 



V~ l + r h I V- 1 
V- 1 V- 1 + r c Q 



V" 1 + r h J V" 1 

\Z- 1 + r c g-y- 1 (V- 1 + r ft /)-V- : 



(6) 



by subtracting a matrix multiple of the first row block from the second. The determinant 
of the right hand side is the product of the determinants of the two diagonal blocks. The 
determinant of the first block is bounded by 



N 



The second diagonal block is 



V- 1 + r c Q - V-'iV- 1 + Th^V- 1 = diag(Y - Y 2 /(Y + r,)) + r c Q = diag (y^r) 



(7) 



T C Q 



Y + r, 

The determinant can be bounded below by replacing the Y$ by their minimum value (minY). 
We can then write Q = UAU T with A = diag(Ai, . . . , A7v)-This gives the lower bound 

N / • V \ 
/ mi n Y t, \ 

(8) 



IV- 1 + r c Q - V~\V~ l + r h )-W- l \ > J] f mi ! Y ^ 



min Y + Th 



> 



i=l 

( minY 



min Y + Th 



N-l 



rurf ] [ A, 



(9) 



i=l 
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Combining (E]) , O , © , we obtain the bound, 



N/2 M/2 
\C(T h ,T c )\V< 



< 



min Y + Th 
min Y 



1/2 



1/2 1-riV-l ,1/2 



T, 



h 



n^ 1 k 



□ 





HN,6 




/XjV — 




= 2 




VN,<f> 





Lemma 2. Let £/je center for the multivariate-t approximation, r 2 , be 

V- 1 + r h I V' 1 V^fi 
V- 1 V- 1 + t c Q V^fi 

Then, the quadratic terms ||r c /ijv,0|| IK/i/^iv.ell are bounded, 

WtcVnA ^^WQ^W II^(y 2 )|| hah 

\\r h ^N,e\\ < 2 IIAII- 

Proof. We will use a non-singular version of the Q matrix. Let Q = Q + 57, for a small 5 > 
in computing fijy. This does not change the model, and is necessary only for the center, not 
the spread of the approximation. Using a partitioned form of the inverse, we can write 



c- 1 



with 



A -AV-^V' 1 + t c Q)~ 1 

-bv-\v- 1 + T h iy l B 



A = [V- 1 + r h I - V~\V- 1 + TcQ^V' 1 ]- 1 
B = [r c Q + V' 1 - V-\V- 1 + ThiyW' 1 ]- 1 



Since V 1 = diag(Y), B simplifies to 



and 



Now 



B 



2 \ 1-1 



t c Q + diag Y 



c- 1 



Y + r h 



(j, N><j> = 2B [I - dia 



.4 

-B diag (^- h 
Y 



Y + r h 



t c Q + diag 



Y + T h 



_ AV -i(y-i + Tc Q)-i 
B 



Th y 2 

diag(Y)/} = B diag ( — - — ] 
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To see that 1 1 /x jv,^, 1 1 is bounded, note that 

B- 1 > diag h 



r h Y 
Y + r h 

in non-negative definite matrix order and therefore 



|^jv,<a|| < 2 



diag 



r h Y 
Y + r h 



2||diae(Y 



To bound ||T c /iAr J|, we can similarly use 



and 



to conclude that 



B- 1 > r c Q 



diag ( ) < diagfY 2 ^ 



ll^^ll < 2HQ- 1 !! ||diag(Y 2 )|| 
The component /i^g is given by 

fi Nfi = 2A[I - V~\V~ l + T c Q)- l ]V- l ji = 2A[V~ l - V^iV" 1 + t c Q)' V" 1 ]// 

To bound ||/xjve|| an d 1 1 Thf^we || we can use the inequalities 



A< [V- l V-\V- 1 + t c Q)- 1 V 



1ta-11-1 



and 



Using the first inequality 



A < —I 

Tr 



me\\ < 2|| [V^V-^V- 1 + r^-V- 1 ]- 1 ^- 1 - V' 1 ^ 1 + t c Q)- x V~Y4 = 2\\(x 



and using the second inequality 



\ThHN,e\\ ^ 2T ft 



1 

— A* 



□ 
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Lemma 3. 

N 



exp ( £((0, + m - Eke*** - ^9 T 9 - ^ T 



. i=l 

is bounded. 



i + -(Q-fi N )c(e-^ N ) 



(u r +2N)/2 



Proof. First note that for A, 5 > we have 



1 + A + B=(1 + A)[1 + 



B 



1 + A 

It is therefore sufficient to bound each of the three terms 



< (1 + A){1 + B) 



exp 



(1 \ ( u r 

i + -(e-fi N )cr*(e-ii N )) 

Tc fQ<P) ( 1 + -(© - Hn)C***(® - fJL N ) 



(v r +2N)/2 



(v r +2N)/2 



(v r +2N)/2 



(10) 



'111 



(12) 



C* 



v- 1 v- 1 
v- 1 v- 1 



c* 



T h I 





c* 



where 

r o 

t c Q 

To bound ( llOjl . reparameterize in terms of W{ = 9i + <pi and = ^ — 0*. Then the 
quadratic form in fflUj) can be written as 

V- 1 v- 1 
V- 1 r- 1 



(e - c*(e - p s ) = [(w - niy (v - ay ] 



1 i 

2 2 



1 _ 1 

2 2 



1 1 

2 2 

1 1 



N 



*\2 



i=l 



with 







"ij if " 

2 J 2 J 






1/ _!/ 
2 2 _ 



This does not depend on v. So the quadratic form in ( flOj) is bounded by 



iv 



(6 - fi N ) T C*(Q - fi N ) < Kt + K 2 Yi 



W; 



1=1 
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for some constants K\ and K 2 . The term ffTUl) can thus be bounded by 



N 



'l + Ki) {u r+ 2N),2 J 



i=l 



exp(wiYi - Eie w >) [l + ^Y^ 



(v r +2N)/2 



None of these terms involves Th or r c and since Y{ > each term 



A" 



expiwiY, - E ie w ') [ 1 + ^YiW 2 



(v r +2N)/2 



is bounded in Wi. So (ITUl) is bounded. Using Lemma [2], the quadratic forms in (TTTj) and (fl2"l) 
are bounded by 

(0 - ^) T C**(6 - = r h (9 - fi N , 9 ) T (6 - me ) < K 3 + K 4 r h 6 T 6 



and 



(6 - fi N y C***(Q - fl N ) = T c {<f> - LIN,*) 1 Q{<t> - lis,*) < K $ + K ^< 



lT 



for some constants K 3 , K4, K 5 , and K 6 . Thus term ffTTl) is bounded by 

(v r +2N)/2*\ 



(l + A- 3) ™ 8= up{exp(-|)(l + ^)' 

and term f|T2|) is bounded analogously. 

We can now utilize the above lemmas to prove Proposition [TJ 



< 00 



□ 



Proposition^ The ratio nfa, t c , 6)/r(r^, r c , 6) can be written as 

N 

+ mi - - ^e T e - T ^f 

2 



i + -(e-fi N )c(e-fi N ) 



{u r +2N)/2 



Th T c 



1 , j_ /'log(Tfc) - //ft 



XT 



JV/2+«A_M/2+a c 



det(C) 



-0.5 



(u h + l)/2 r 



j_ / l0g(T c ) ~ ^ 



(fc+l)/2 



The first term is bounded by Lemma [31 We can apply Lemma [T] to bound the determinant 
in the third term. Hence, it follows that the above ratio (ignoring constants) is bounded by 



1 , J_ ( log(r h ) - ///, 



"a 



K+l)/2 



1 , j_ / log(r c ) - /i ( 



2 -i (f c +l)/2 



XT /i r c 



min Y + r/ t 
min Y 



1/2 



.1/2 : 



which is bounded as long as > 1. 
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